Dunnett’s Test — Multiple Comparisons vs a Control#
Dunnett’s test is designed for experiments where you compare several treatments to a single control and want to control the family-wise error rate (FWER).
What you’ll learn#
When Dunnett’s test is the right tool (and when it isn’t)
The test statistic and the “multiple comparisons vs control” adjustment
How to interpret adjusted p-values and simultaneous confidence intervals
A NumPy-only implementation (Monte Carlo approximation) with Plotly visuals
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
# Plotly rendering (CKC convention)
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
# Reproducibility
rng = np.random.default_rng(7)
np.set_printoptions(precision=4, suppress=True)
1) When to use Dunnett’s test#
Use Dunnett’s test when:
You have one control group and m treatment groups.
You care about treatment vs control comparisons (not every pair).
You want to control the probability of any false positive across those m tests (FWER ≤ α).
Typical examples:
Dose–response studies (multiple doses vs placebo)
A/B/n experiments with a baseline
Regression testing (new versions vs current control)
Not the right tool if:
You want all pairwise comparisons (use Tukey’s HSD).
Variances differ strongly between groups (consider Welch-type approaches).
2) What problem it solves (multiple testing)#
For each treatment group i (compared to control c), the hypotheses are:
Two-sided:
\(H_{0,i}: \mu_i = \mu_c\) vs \(H_{1,i}: \mu_i \ne \mu_c\)One-sided (greater):
\(H_{0,i}: \mu_i \le \mu_c\) vs \(H_{1,i}: \mu_i > \mu_c\)
If you run m separate tests at level \(\alpha\), the chance of getting at least one false positive grows with m. Under independence (a useful intuition),
Dunnett’s test chooses a single critical value so that the maximal evidence against \(H_0\) across treatments triggers a rejection only with probability \(\alpha\) under the global null.
alpha = 0.05
m = np.arange(1, 21)
fwer_independent = 1 - (1 - alpha) ** m
fig = px.line(
x=m,
y=fwer_independent,
markers=True,
labels={"x": "Number of comparisons (m)", "y": "FWER (independence intuition)"},
title="Why multiple t-tests inflate false positives",
)
fig.add_hline(y=alpha, line_dash="dash", annotation_text="Target α", annotation_position="bottom right")
fig.show()
4) Model & assumptions#
Dunnett’s test is built on the one-way ANOVA model:
Assumptions (the same spirit as pooled-variance ANOVA):
observations are independent
residuals are approximately normal (often reasonably robust for moderate n)
homoscedasticity: all groups share the same variance \(\sigma^2\)
5) Test statistic (pooled-variance, ANOVA-style)#
Let:
control size \(n_c\) and treatment size \(n_i\)
sample means \(\bar y_c\) and \(\bar y_i\)
pooled within-group mean square \(\mathrm{MSE}\) (from one-way ANOVA)
degrees of freedom \(\mathrm{df} = N_{\text{total}} - k\) where \(k\) is the number of groups
Then for each treatment i:
Dunnett’s test differs from running m t-tests not by the statistic, but by the multiple-comparison adjustment used for decisions, p-values, and confidence intervals.
6) Correlation structure (why “Dunnett” is special)#
For two different treatments i and j, the differences \(\bar y_i - \bar y_c\) and \(\bar y_j - \bar y_c\) share the same control mean. That induces positive correlation.
A convenient approximation for the correlation of the standardized mean differences is:
Balanced design example: if \(n_c = n_i = n\), then \(\mathrm{Corr}(T_i, T_j) = 1/2\).
7) NumPy-only Dunnett via Monte Carlo (educational, but accurate with enough samples)#
The exact Dunnett p-values/critical values come from a multivariate t distribution. Instead of calling a stats library, we’ll simulate it directly:
Build the correlation matrix \(R\) of the treatment-vs-control statistics.
Sample \(Z \sim \mathcal{N}(0, R)\).
Sample an independent \(U \sim \chi^2_{\mathrm{df}}\).
Form \(T = Z / \sqrt{U/\mathrm{df}}\) (this is multivariate t).
Use the distribution of \(\max_i |T_i|\) to get a critical value and adjusted p-values.
This is the core principle behind Dunnett’s adjustment.
from typing import Dict, Literal, Tuple
Alternative = Literal["two-sided", "greater", "less"]
def pooled_mse(groups: Dict[str, np.ndarray]) -> Tuple[float, int]:
'''Pooled within-group mean square (ANOVA MSE).
MSE = SSE / df, where SSE = sum_g sum_j (y_gj - mean_g)^2
and df = N_total - k_groups.
'''
if len(groups) < 2:
raise ValueError("Need at least two groups (including control).")
sse = 0.0
n_total = 0
for x in groups.values():
x = np.asarray(x, dtype=float)
if x.ndim != 1:
raise ValueError("Each group must be a 1D array.")
n_total += x.size
sse += float(np.sum((x - x.mean()) ** 2))
k = len(groups)
df = n_total - k
if df <= 0:
raise ValueError("Not enough total samples to estimate pooled variance.")
return sse / df, int(df)
def dunnett_corr(n_control: int, n_treatments: np.ndarray) -> np.ndarray:
'''Correlation matrix for treatment-vs-control standardized mean differences.'''
n_treatments = np.asarray(n_treatments, dtype=float)
if np.any(n_treatments <= 0) or n_control <= 0:
raise ValueError("Sample sizes must be positive.")
a = 1.0 + (n_control / n_treatments)
corr = 1.0 / np.sqrt(np.outer(a, a))
np.fill_diagonal(corr, 1.0)
return corr
def dunnett_t_stats(
groups: Dict[str, np.ndarray],
control: str,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, int]:
'''Compute Dunnett t-statistics vs control.
Returns (treat_names, diffs, t_stats, mse, df)
'''
if control not in groups:
raise KeyError(f"Control group '{control}' not found.")
mse, df = pooled_mse(groups)
names = [g for g in groups.keys() if g != control]
if len(names) == 0:
raise ValueError("Need at least one treatment group besides control.")
y_c = np.asarray(groups[control], dtype=float)
n_c = y_c.size
mean_c = float(y_c.mean())
diffs = []
t_stats = []
for g in names:
y = np.asarray(groups[g], dtype=float)
n_i = y.size
diff = float(y.mean()) - mean_c
se = float(np.sqrt(mse * (1.0 / n_i + 1.0 / n_c)))
diffs.append(diff)
t_stats.append(diff / se)
return np.array(names), np.array(diffs, dtype=float), np.array(t_stats, dtype=float), float(mse), int(df)
def simulate_max_stat(
df: int,
corr: np.ndarray,
n_sim: int,
alternative: Alternative,
rng: np.random.Generator,
) -> np.ndarray:
'''Simulate the max statistic used by Dunnett’s adjustment.'''
corr = np.asarray(corr, dtype=float)
if corr.ndim != 2 or corr.shape[0] != corr.shape[1]:
raise ValueError("corr must be a square matrix")
m = corr.shape[0]
if m < 1:
raise ValueError("Need at least one treatment.")
L = np.linalg.cholesky(corr)
z = rng.standard_normal((n_sim, m)) @ L.T
u = rng.chisquare(df, size=n_sim) / df
t = z / np.sqrt(u)[:, None]
if alternative == "two-sided":
return np.max(np.abs(t), axis=1)
if alternative == "greater":
return np.max(t, axis=1)
if alternative == "less":
return np.max(-t, axis=1)
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'.")
def dunnett_test(
groups: Dict[str, np.ndarray],
control: str,
alpha: float = 0.05,
alternative: Alternative = "two-sided",
n_sim: int = 200_000,
seed: int = 7,
) -> Dict[str, object]:
'''Monte Carlo Dunnett test (NumPy-only).
Returns a dict containing:
- names: treatment names
- diffs: mean(treatment) - mean(control)
- t: t-statistics
- p_adj: Dunnett-adjusted p-values (single-step)
- ci_low, ci_high: simultaneous CI for mean differences (two-sided)
- crit: Dunnett critical value for the max statistic
- mse, df: pooled variance estimate and df
- corr: correlation matrix among T_i
'''
if not (0 < alpha < 1):
raise ValueError("alpha must be in (0,1).")
names, diffs, t_stats, mse, df = dunnett_t_stats(groups, control)
n_c = int(np.asarray(groups[control]).size)
n_t = np.array([int(np.asarray(groups[g]).size) for g in names], dtype=int)
corr = dunnett_corr(n_c, n_t)
sim_rng = np.random.default_rng(seed)
max_stat = simulate_max_stat(df=df, corr=corr, n_sim=n_sim, alternative=alternative, rng=sim_rng)
crit = float(np.quantile(max_stat, 1 - alpha))
if alternative == "two-sided":
t_for_p = np.abs(t_stats)
elif alternative == "greater":
t_for_p = t_stats
else:
t_for_p = -t_stats
p_adj = np.array([(max_stat >= float(t0)).mean() for t0 in t_for_p], dtype=float)
# Two-sided simultaneous CI for mean differences
se = np.sqrt(mse * (1.0 / n_t + 1.0 / n_c))
ci_low = diffs - crit * se
ci_high = diffs + crit * se
reject = p_adj < alpha
return {
"names": names,
"diffs": diffs,
"t": t_stats,
"p_adj": p_adj,
"ci_low": ci_low,
"ci_high": ci_high,
"reject": reject,
"crit": crit,
"mse": mse,
"df": df,
"corr": corr,
"max_stat_sim": max_stat,
"alternative": alternative,
"alpha": alpha,
}
8) Worked example (with visuals)#
We’ll simulate a small experiment with one control and three treatments. Two treatments have real effects, one does not.
groups = {
"Control": rng.normal(loc=0.0, scale=1.0, size=22),
"Dose 1": rng.normal(loc=0.25, scale=1.0, size=20),
"Dose 2": rng.normal(loc=0.75, scale=1.0, size=18),
"Dose 3": rng.normal(loc=0.00, scale=1.0, size=24),
}
{g: (x.size, float(x.mean())) for g, x in groups.items()}
{'Control': (22, -0.3820492834449359),
'Dose 1': (20, -0.11140381300226458),
'Dose 2': (18, 0.9206856310625974),
'Dose 3': (24, -0.08548638452023795)}
fig = go.Figure()
for name, x in groups.items():
fig.add_trace(go.Violin(y=x, name=name, box_visible=True, meanline_visible=True))
fig.update_layout(
title="Simulated data by group",
yaxis_title="Outcome",
violinmode="group",
)
fig.show()
res = dunnett_test(groups, control="Control", alpha=0.05, alternative="two-sided", n_sim=250_000, seed=1)
res["crit"], res["df"], res["mse"]
(2.392486887925362, 80, 0.7378219350831077)
def print_results(res: Dict[str, object]) -> None:
names = res["names"]
diffs = res["diffs"]
t = res["t"]
p = res["p_adj"]
lo = res["ci_low"]
hi = res["ci_high"]
reject = res["reject"]
header = f"{'Treatment':<12} {'Diff':>9} {'t':>9} {'p_adj':>10} {'CI_low':>10} {'CI_high':>10} {'Reject':>8}"
print(header)
print("-" * len(header))
for i in range(len(names)):
print(
f"{str(names[i]):<12} "
f"{diffs[i]:9.4f} "
f"{t[i]:9.4f} "
f"{p[i]:10.4f} "
f"{lo[i]:10.4f} "
f"{hi[i]:10.4f} "
f"{str(bool(reject[i])):>8}"
)
print_results(res)
Treatment Diff t p_adj CI_low CI_high Reject
--------------------------------------------------------------------------
Dose 1 0.2706 1.0198 0.6145 -0.3643 0.9056 False
Dose 2 1.3027 4.7720 0.0000 0.6496 1.9559 True
Dose 3 0.2966 1.1697 0.5124 -0.3100 0.9031 False
names = res["names"]
diffs = res["diffs"]
lo = res["ci_low"]
hi = res["ci_high"]
reject = res["reject"]
colors = np.where(reject, "#d62728", "#1f77b4")
fig = go.Figure()
fig.add_trace(
go.Bar(
x=names,
y=diffs,
marker_color=colors,
error_y=dict(type="data", array=hi - diffs, arrayminus=diffs - lo),
name="Mean difference",
)
)
fig.add_hline(y=0, line_dash="dash")
fig.update_layout(
title="Treatment − Control mean differences with Dunnett simultaneous CI",
yaxis_title="Mean difference",
)
fig.show()
corr = res["corr"]
fig = px.imshow(
corr,
text_auto=".2f",
color_continuous_scale="Blues",
title="Correlation among treatment-vs-control t-statistics",
labels=dict(x="Treatment", y="Treatment", color="corr"),
)
fig.update_xaxes(ticktext=res["names"], tickvals=list(range(len(res["names"])) ))
fig.update_yaxes(ticktext=res["names"], tickvals=list(range(len(res["names"])) ))
fig.show()
max_stat = res["max_stat_sim"]
crit = res["crit"]
obs = np.abs(res["t"])
fig = go.Figure()
fig.add_trace(go.Histogram(x=max_stat, nbinsx=60, name="Simulated max |T|", opacity=0.75))
fig.add_vline(x=crit, line_dash="dash", line_color="black", annotation_text=f"crit ≈ {crit:.3f}")
for t0 in obs:
fig.add_vline(x=float(t0), line_color="red", opacity=0.35)
fig.update_layout(
title="Dunnett null distribution: max |T| (Monte Carlo)",
xaxis_title="max |T|",
yaxis_title="count",
)
fig.show()
9) How to interpret Dunnett results#
For each treatment vs control:
Adjusted p-value: the probability (under the global null) that any comparison would look at least this extreme.
Ifp_adj < α, you can claim that treatment differs from control while controlling FWER.Sign of the difference / t-statistic: direction of the effect (treatment higher vs lower than control).
Simultaneous confidence interval: a set of intervals for all treatment–control differences such that all of them are correct at once with probability \(1-\alpha\). If the CI excludes 0, it matches the “reject” decision.
What it does not mean:
p_adj = 0.03is not the probability the null is true.A non-rejection does not prove equality; it means “not enough evidence given the noise and sample sizes.”
10) Quick insight: FWER control (simulation)#
Below we simulate the global null and compare how often we get at least one false positive.
Naive: compare each treatment to control at level \(\alpha\).
Bonferroni: compare each at level \(\alpha/m\).
Dunnett: use the max-|T| critical value.
Everything uses the same df and correlation induced by the shared control.
def simulate_abs_t(df: int, n_sim: int, rng: np.random.Generator) -> np.ndarray:
z = rng.standard_normal(n_sim)
u = rng.chisquare(df, size=n_sim) / df
return np.abs(z / np.sqrt(u))
m = len(res["names"])
df = int(res["df"])
sim_rng = np.random.default_rng(123)
abs_t = simulate_abs_t(df=df, n_sim=400_000, rng=sim_rng)
crit_unadj = float(np.quantile(abs_t, 1 - alpha))
crit_bonf = float(np.quantile(abs_t, 1 - alpha / m))
crit_dun = float(res["crit"])
crit_unadj, crit_bonf, crit_dun
(1.9909334640191727, 2.4425017266737616, 2.392486887925362)
# Simulate multivariate t under the global null
n_exp = 80_000
sim_rng = np.random.default_rng(456)
L = np.linalg.cholesky(res["corr"])
z = sim_rng.standard_normal((n_exp, m)) @ L.T
u = sim_rng.chisquare(df, size=n_exp) / df
T = z / np.sqrt(u)[:, None]
any_unadj = (np.abs(T) > crit_unadj).any(axis=1)
any_bonf = (np.abs(T) > crit_bonf).any(axis=1)
any_dun = (np.abs(T) > crit_dun).any(axis=1)
fwer = {
"Naive (α per test)": float(any_unadj.mean()),
"Bonferroni": float(any_bonf.mean()),
"Dunnett": float(any_dun.mean()),
}
fig = px.bar(
x=list(fwer.keys()),
y=list(fwer.values()),
labels={"x": "Method", "y": "Empirical FWER under global null"},
title=f"FWER control at α={alpha} (Monte Carlo)",
)
fig.add_hline(y=alpha, line_dash="dash", annotation_text="Target α", annotation_position="bottom right")
fig.update_yaxes(range=[0, max(list(fwer.values()) + [alpha]) * 1.25])
fig.show()
11) Pitfalls & diagnostics#
Heteroscedasticity (unequal variances) can distort pooled-variance methods; check residual spread by group.
Non-normality can matter for small n; use residual plots and consider robust/nonparametric alternatives.
Only vs control: if you later start caring about treatment-vs-treatment, switch to an all-pairs method.
Power: Dunnett is usually more powerful than Bonferroni for the same FWER because it leverages correlation.
Practical workflow:
Inspect plots (group distributions, residuals).
If assumptions look reasonable, run Dunnett.
Report mean differences with simultaneous CIs and adjusted p-values.
12) Exercises#
Increase
n_simand see how stable the critical valueres["crit"]becomes.Make the design balanced (same n for all groups) and verify that the correlation off-diagonal is close to 0.5.
Change
alternativeto"greater"and interpret the results when effects are negative.
References#
Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control.
Many statistical software packages implement exact Dunnett p-values via multivariate t CDFs; this notebook uses Monte Carlo to keep the mechanics transparent.